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ABSTRACT 

Background and Objectives: Whole-genome sequencing is becoming popular as a tool for understand- 
ing outbreaks of communicable diseases, with phylogenetic trees being used to identify individual 
transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods 
to infer transmission dynamics from sequence data rely on well-characterized infectious periods, epi- 
demiological and clinical metadata which may not always be available, and typically require computa- 
tionally intensive analysis focusing on the branch lengths in phylogenetic trees. We sought to determine 
whether the topological structures of phylogenetic trees contain signatures of the transmission patterns 
underlying an outbreak. 

Methodology: We use simulated outbreaks to train and then test computational classifiers. We test the 
method on data from two real-world outbreaks. 

Results: We show that different transmission patterns result in quantitatively different phylogenetic tree 
shapes. We describe topological features that summarize a phylogeny's structure and find that com- 
putational classifiers based on these are capable of predicting an outbreak's transmission dynamics. 
The method is robust to variations in the transmission parameters and network types, and recapitulates 
known epidemiology of previously characterized real-world outbreaks. 

Conclusions and implications: There are simple structural properties of phylogenetic trees which, when 
combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous 
transmission and chains of transmission. This is possible using genome data alone, and can be done 
during an outbreak. We discuss the implications for management of outbreaks. 



KEYWORDS: evolutionary epidemiology; genomic epidemiology; computational modelling; machine 
learning 
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INTRODUCTION 

Whole-genome sequence data contain rich informa- 
tion about a pathogen population from which sev- 
eral evolutionary parameters and events of interest 
can be inferred. When the population in question 
comprises pathogen isolates drawn from an out- 
break or epidemic of an infectious disease, these 
inferences may be of epidemiological importance, 
able to provide actionable insights into disease 
transmission. Indeed, since 2010, several groups 
have demonstrated the utility of genome data for 
revealing pathogen transmission dynamics and 
identifying individual transmission events in out- 
breaks [1-9], with the resulting data now being used 
to inform public health's outbreak management and 
prevention strategies. To date, these reconstruc- 
tions have relied heavily on interpreting genomic 
data in the context of available epidemiological data, 
drawing conclusions about transmission events 
only when they are supported by both sequence data 
and plausible epidemiological linkages collected 
through field investigation and patient interviews. 

Given the rapidly growing interest in this new field 
of genomic epidemiology, several recent studies 
have explored whether transmission events and pat- 
terns can be deduced from genomic data alone. 
Phylogenies derived from whole-genome sequence 
data can be compared with theoretical models 
describing how a tree should look under particular 
processes; this has been done for viral sequence 
data over the past several decades [10,11]. For 
example, predicted branch lengths from sequences 
modelled using birth-death processes can be 
compared with branch lengths in trees inferred from 
viral sequence data to explore transmission patterns 
[1 ,1 2-1 4]. The field of linking properties of pathogen 
phylogenies to underlying dynamics is termed 
'phylodynamics', coined by Grenfell et al. [15]. 
Tools from coalescent theory have been adapted to 
pathogen transmission; where coalescent theory 
describes probability distributions on trees under a 
given model forthe population size, epidemiological 
versions take into account the relationship between 
pathogen prevalence (population size) as well as 
incidence [16,17]. These approaches are powerful 
but are computationally intensive and have not 
explicitly focused on another potential source of in- 
formation within a phylogeny — 'tree shape'. 

The number of different phylogenetic tree shapes 
on n leaves is a combinatorially exploding function 
of n (there are (2n - 3)(2n - 5)(2n - 7)...(5)(3)(1) 



rooted labelled phylogenetic trees, or ~10 184 trees 
on 100 tips, compared with ~10 80 atoms in the 
universe). For the increasingly large outbreak gen- 
ome datasets being obtained and analysed (390 [3] 
616 [18] and recently 1000 [19] bacterial genomes), 
the numbers of possible tree shapes are effectively 
infinite. In the homogeneous birth (Yule) model, the 
distribution of labelled histories (tree shape to- 
gether with the ordering of internal nodes in time) 
is uniform, so that there is a close relationship be- 
tween the branching times and the tree shapes [20]. 
Perhaps for this reason, tree shapes have not 
typically been seen as very informative. However, 
for bacterial pathogens, particularly those with long 
durations of carriage and variable infectious rates, 
there is variability in the infection process which is 
not captured by homogeneous models. This motiv- 
ates asking the question: does tree shape carry epi- 
demiological information? Recent work indicates 
that tree shape reveals aspects of the evolution of 
viral pathogens [13,21-24], but to date, we do not 
have methods to exploit tree shape in an analysis of 
pathogen transmission dynamics, built upon 
simulated data and validated using real-world out- 
break data. 

Host contact network structure is one of the most 
profound influences on the dynamics of an outbreak 
or epidemic, and outbreak management and control 
strategies depend heavily upon the type of trans- 
mission patterns driving an outbreak. It is reason- 
able to expect that pathogen genomes spreading 
over different contact network structures — chains, 
homogenous networks, or networks containing 
super-spreaders, as illustrated in Fig. 1 — would 
accrue mutations in different patterns, leading to 
observably different phylogenetic tree shapes. We 
therefore characterized the structural features of 
phylogenetic trees arising from the simulated evolu- 
tion of a bacterial genome as it spreads over multiple 
types of contact network. We found simple topo- 
logical properties of phylogenetic trees that, when 
combined, can be used to classify trees according 
to whether the underlying process is chain-like, 
homogenous, or super-spreading, demonstrating 
that phylogenetic tree structure can reveal transmis- 
sion dynamics. We use these properties as the basis 
for a computational classifier, which we then use to 
classify real-world outbreaks. We find that the com- 
putational predictions of each outbreak's overall 
transmission dynamics are consistent with known 
epidemiology. 
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Figure 1. Schematic illustration of different kinds of transmission networks. The index case is marked in grey. 



MATERIALS AND METHODS 

Transmission model 

We simulated disease transmission networks with 
three different underlying transmission patterns: 
homogeneous transmission, transmission with a 
super-spreader and chains of transmission. Each 
simulation started with a single infectious host 
who infects a random number of secondary cases 
over his or her infectious period; each secondary 
case infects others, and so on, until the desired 
maximum number of cases is reached. The models 
share two key parameters: a transmission rate /3 and 
a duration of infection parameter D. Our baseline 
values are |3 = 0.43 per month and D = 3 months, 
reflecting a basic reproduction number of 1 .3. This is 
also the mean number of secondary infections for 
each infectious case. We do not consider depletion 
of susceptible contacts over time (saturation) as we 
model small growing outbreaks at or near the 
beginning of their spread in a community, and our 
data (for tuberculosis (TB) in a developed setting) 
suit this assumption. 

The homogeneous transmission model assigns 
each infectious host a number of secondary infec- 
tions drawn from a Poisson distribution with param- 
eter Rq = |3D. New infections are seeded uniformly 
in time over the host's infectious period. In the 
super-spreader model, one host (at random in 
the first five hosts) seeds 7-24 new infections (uni- 
formly at random), and all other hosts are as in 
the homogeneous transmission model. In the 
chain-of-transmission model, almost all hosts infect 
precisely one other individual. However, 2 (with 



probability 2/3) or 3 (with probability 1/3) of the 
hosts infect two other individuals, so that the 
transmission tree consists of several chains of 
transmission randomly joined together. 

Durations of infection are drawn from a V dis- 
tribution with a shape parameter of 1.5 and a 
scale parameter of D/1.5. To reflect transmission 
of a chronically infecting pathogen, such as 
Mycobacterium tuberculosis, cases were infectious 
for between 2 and 14 months with an average 
specified by D. The mean infectious period was 4.3 
months; a histogram is shown in supplementary Fig. 
S2. We simulated 1000 outbreaks containing a 
super-spreader, 1000 with homogeneous transmis- 
sion and 1000 chain-like outbreaks. These used a 
fixed parameter set; we also performed a sensitivity 
analysis using alternative parameters. To ensure 
that the size of the outbreak did not affect the tree 
shape and classification, we simulated outbreaks 
with 32 hosts — a similar size as the real-world out- 
breaks we later investigated. We consider the effects 
of phylogenetic noise in the Supplementary 
Material. 



Genealogies and phylogenies from the process 

We extracted the true genealogical relationships as a 
full rooted binary tree (a 'phylogeny'), with tips cor- 
responding to hosts and internal nodes correspond- 
ing to transmission events among the hosts, as 
follows. The outbreak simulations create lists of 
who infected whom and at what time. Each host 
also has a recovery time. We sort the times of all of 
the infection events, and proceed in reverse order. 
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The last infection event must correspond to a 
'cherry', i.e. it must have two tip descendants, one 
corresponding to the infecting host and one to the 
infectee. For all other infection events proceeding in 
reverse orderthrough the transmission, we create an 
internal node, and determine its descendants by 
determining whether the infector and the infectee 
went on to infect anyone else subsequently. If not, 
then the node's descendants are the infector and 
infectee at the time of sampling. If so, then the des- 
cendant represents the infector or infectee at the 
time of their next transmission. The tree is rooted 
at the first infection event. Branch lengths corres- 
pond to the times between infection events or, for 
tips, the time between the infection event and the 
time of sampling. This approach uses the simplify- 
ing assumption that branching points in the patho- 
gen genealogy correspond to transmission events, 
as is done in almost all phylodynamic methods (see 
[1 ,1 7,24]) However, where there is in-host pathogen 
diversity transmission events do not correspond to 
phylogenetic branching points [7,9]. We comment 
on the constraints tree shape places on the space 
of possible transmission trees consistent with a 
phylogeny in [9]. 

In the main text, we use the true genealogical 
relationships among the hosts in our outbreak, ex- 
tracted from the simulations — this reduces phylo- 
genetic noise and it allows us to compare the 
resulting trees to 1000 samples of the BEAST pos- 
terior timed phylogenies derived from whole- 
genome sequence (WGS) data from the two real- 
world outbreaks. To determine how sensitive our 
approach is to phylogenetic noise, we also classified 
the outbreaks using neighbour-joining phylogenies 
derived from simulated gene sequences 
(Supplementary Information). 

Topological summaries of trees 

Eleven summary metrics were used to summarize the 
topology of the trees (see supplementary Table SI). 

(1) Imbalance. The Colless imbalance [25] is 
defined as (w _ 1) 2 (w _ 2) Y,i=l \ J n ~ where n 
is the number of tips and T ri and 7// are the 
number of tips descending from the left and 
right sides at internal node /. It is a 
normalized measure of the asymmetry of a 
rooted full binary tree, with a completely 
asymmetric tree having imbalance of 1 and 
a symmetric tree having an imbalance of 0 



[26]. The Sackin imbalance [27] is the average 
length of the paths from the leaves to the 
root of the tree. 

(2) Ladders, IL nodes. We define the ladder 
length' to be the maximum number of con- 
nected internal nodes with a single leaf des- 
cendant, and we divide it by the number of 
leaves in the tree. This measure is not unre- 
lated to tree imbalance but is more local — a 
long ladder motif may occur in a tree that is 
otherwise quite balanced. For this reason, lad- 
der length may detect trees in which there has 
been differential lineage splitting in some 
clades or lineages, but where this occurred 
too locally or in clades that are too small have 
affected traditional approaches in charac- 
terizing rapid expansion in some lineages. 
Furthermore, traditional ways of detecting 
positive selection may not be appropriate in 
this context because the super-spreader, if pre- 
sent, does not pass any advantageous prop- 
erty to descendant infections. The portion of 
'IL' nodes is the portion of internal nodes with 
a single leaf descendant. 

(3) Maximum width; maximum width over max- 
imum depth. The 'depth' of a node in a tree is 
the number of edges between that node and 
the tree's root. The 'width' of a tree at a depth 
d is defined as the number of nodes with depth 
d. We calculated the maximum width of each 
tree divided by its maximum depth (max d, the 
maximum depth of any leaf in the tree). 

(4) Maximum difference in widths. We compared 
Aw = max \{\w(di) — in the trees. 
This summary reflects the maximum absolute 
difference in widths from one depth to the 
next, over all depths d\ in the tree. 

(5) Cherries. A cherry configuration is a node 
with two leaf descendants. 

(6) Staircase-ness. We use two measures of the 
'staircase-ness' of phylogenies defined by 
Norstrom et al. [21]: (i) the portion of sub- 
trees that are imbalanced (i.e. that have dif- 
ferent numbers of descending tips on the left 
and right sides) and (ii) the average of min ( 
Tn, T r /)/max (T\iJ r ) over the internal nodes of 
the tree. 

Outbreak classification 

We trained k-nearest-neighbour (KNN) classifiers 
using matlab's ClassificationKNN.fit function 
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with a Minkowski distance, inverse distance 
weighting and 100 neighbours. KNN classification 
was performed on 1000 trees of each type (homoge- 
neous transmission, super-spreaders, chains) using 
10-fold cross-validation. The 10 resulting classifiers 
were then used to classify the groups of simulations 
in the sensitivity analysis, allowing us to report on 
the variability of classification results. KNN classifi- 
cation is suitable for sets of data that have any num- 
ber of groups. Here, there were three groups: 
homogeneous outbreaks, super-spreader outbreaks 
and chains of transmission. KNN classifiers' quality 
can be assessed with a table reporting how many in 
each group are correctly classified, and how many 
are classified into which incorrect group. 
Alternatively, the quality can be summarized by re- 
porting the portion of each group that is classified 
correctly. 

When there are only two groups to compare, so 
that classification is binary, better methods are 
available. One of the most powerful of these is the 
support vector machine (SVM) approach. We used a 
10-fold cross-validated SVM to resolve differences 
between homogeneous transmission versus super- 
spreader networks. Because SVMs are binary classi- 
fiers, their quality can be assessed by reporting the 
sensitivity (portion of true positives that are 
classified as positive) and specificity (portion of true 
negatives that are classed as negatives) of the pre- 
dictions. The sensitivity and specificity of a classifier 
trade off with each other, because it is always pos- 
sible to classify all cases as positive (sensitivity 1 but 
specificity 0) or all as negative (specificity 1 but sen- 
sitivity 0). Classifiers use a cutoff, calling a data point 
positive if the cutoff is above some threshold, and 
negative otherwise. The overall quality of a binary 
classifier can be visualized using a receiver operator 
characteristic (ROC) curve, which captures the 
change in sensitivity and specificity of a classifier 
when its threshold is changed. See Cristianini and 
Shawe-Taylor [28] for a full discussion of SVMs and 
classification. 

Here, SVMs were constructed usingthe SVMtrain 
method in matlab with a linear kernel function. 
The training data x-, in the /th 'fold' were the 11 
summary metrics for 900 trees derived from each 
process. The test data were the remaining 100 
trees. This was done 10 times (10 'folds' of cross- 
validation). All training data were from simulations 
with the baseline set of parameters. The 10 
SVMs (one for each 'fold') were tested on the 



remaining trees using matlab's SVMclassify, which 
computes 

sign(y) = sign a/k(x/, x) + bj 

where oc, are weights, x, are the support vectors, x is 
the input to be classified, Zeis the kernel function and 
b is the bias. These tests were done separately on the 
different groups of simulated trees. The SVMclassify 
function was modified to return y (i.e. the degree to 
which an outbreak could be considered super- 
spreading) rather than only the sign of y (a binary 
prediction). We have also performed 10-fold SVM 
classification in R using the el 071 package. 
Classifiers are available along with a script to profile 
the structure of a tree in newick format, using 
the phyloTop package [29] (see Supplementary 
Information). 

Sensitivity analysis 

To determine whether the classifier is robust to 
different choices of model parameters and to 
sampling, we simulated three groups of 500 homo- 
geneous and super-spreader outbreaks with (i) 
randomly selected parameters, (ii) a random 
sampling density and (iii) both random parameters 
and random sampling. Group (i) had randomized 
parameters in which p/D was uniformly distributed 
between 1.25 and 2.5. Group (ii) had fixed param- 
eters, but the number of cases varied uniformly 
between 100 and 150, and we sampled only 33 of 
those cases. The third group had both randomized 
parameters and random sampling. 

To ensure that the classification is detecting 
variability in the number of secondary cases (i.e. 
super-spreading), we performed classification on 
outbreaks in which we used a negative binomial dis- 
tribution to determine the numbers of secondary 
cases in the outbreak. We varied the parameters 
of the distribution such that the mean number of 
secondary cases was the same (R 0 ) but the variance 
differed, with the expected variance ranging from 
two to five times what it would be in the Poisson 
(homogeneous transmission) case. We classified 
the outbreaks using the 1 0 SVM classifiers obtained 
under 10-fold cross-validation on the baseline case. 
We report the mean and standard deviation of the 
specificity, i.e. the portion of cases correctly classi- 
fied as 'super-spreader' outbreaks. 

To determine whether the classifier is relevant to 
different kinds of models, we applied it to simulated 
phylogenies described in Robinson etal. [22]. In that 
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work, dynamic networks of sexual contacts were 
created based on random graphs with a Poisson dis- 
tribution, and with a distribution of contacts derived 
from the National Survey on Sexual Attitudes and 
Lifestyles (NATSAL) [30]. See Supplementary mater- 
ial for further details. 

Classification of outbreaks from published 
genomic data 

We used the classifier on phylogenetic trees derived 
from two real-world tuberculosis outbreak datasets. 
Outbreak Awas previously published Gardy etal. [31] 
and is available in the NCBI Sequence Read Archive 
under the accession number SRP002589. This 
dataset comprises 31 M. tuberculosis isolates col- 
lected in British Columbia over the period 1995- 
2008 and was sequenced using paired-end 50 bp 
reads on the lllumina Genome Analyzer II. 
Outbreak B comprises 33 M. tuberculosis isolates col- 
lected in British Columbia over the period 2006-11, 
and was sequenced using paired-end 75 bp reads on 
the lllumina HiSeq. The outbreak, sequences and 
single nucleotide polymorphisms (SNPs) are pre- 
sented in Didelot et al. [9]. 

For both datasets, reads were aligned against 
the reference genome M. tuberculosis CDC1551 
(NC002755) using Burrows-Wheeler Aligner [32]. 
Single nucleotide variants were identified using 
samtools mpileup [33] and were filtered to remove 
any variant positions within 250 bp of each other and 
any positions for which at least one isolate did not 
have a genotype quality score of 222. The remaining 
variants were manually reviewed for accuracy and 
were used to construct a phylogenetic tree for each 
outbreak as described above. We apply the classifi- 
cation methods to 1000 samples from the BEAST 
posterior timed phylogenies estimated from WGS 
data using a birth-death prior. 

RESULTS 

Tree shapes capture transmission patterns 

Different transmission networks result in 
quantitatively different tree shapes 

To determine whether tree shapes captured informa- 
tion about the underlying disease transmission 
patterns within an outbreak, we simulated evolution 
of a bacterial genome over three types of outbreak 
contact network — homogenous, super-spreading 
and chain — and summarized the resulting 
phylogenies with five metrics describing tree shape. 
Figure 2 and 3 illustrate the distributions of these 



metrics across the three types of outbreaks, reveal- 
ing clear differences in tree topology depending on 
the underlying host contact network. Super-spreader 
networks gave rise to phylogenies with higher 
Colless imbalance, longer ladder patterns, lower 
Aw and deeper trees than transmission networks 
with a homogeneous distribution of contacts. 
Trees derived from chain-like networks were less 
variable, deeper, more imbalanced and narrower 
than the other trees. Other topological summary 
metrics considered did not resolve the three out- 
break types as fully (Supplementary Information). 

Classification on the basis of tree shape 

Topological metrics can be used to 
computationally classify outbreaks 

To evaluate whether the five topological summary 
metrics could realiably and automatically differenti- 
ate between the three types of outbreaks, we trained 
a series of computational classifiers on the 
simulated datasets. We first trained a KN N classifier 
using the 1 1 tree features to discern which combin- 
ations of features correspond to phylogenies 
derived from the three underlying transmission 
processes. The KNN classifiers correctly 
identified the underlying transmission dynamics 
well (see Table 1), with an average of 89 (0.03)% 
of the homogeneous outbreaks, 86 (0.05)% of the 
super-spreader outbreaks and 100% of the chain 
outbreaks correctly classified under 10-fold cross- 
validation. Mis-classifications were between the 
homogeneous and super-spreader outbreaks. 

SVM improves classification accuracy 

To better resolve the separation between super- 
spreader-type outbreaks and those with homoge- 
neous transmission, we trained a SVM classifier to 
distinguish between those two types of outbreaks 
alone. Figure 4a shows the ROC curve for an SVM 
classification trained on 300 of the 1000 simulated 
homogeneous and super-spreader outbreaks. The 
area under the curve (AUC) is 0.97, reflecting a very 
good classifier performance; the theoretical max- 
imum AUC is 1, and 0.5 corresponds to random 
guessing. We performed 10-fold cross-validation, 
each time training a new SVM on 900 ofthe 100 trees 
and testing it on the remaining 100. The average 
sensitivity was 0.93 and the average specificity was 
0.89; the average AUC was 0.98. These values are 
listed in Table 1. 
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Schematic illustrating several tree structures Ladder length versus imbalance. Colors correspond 

to different underlying transmission patterns. 



Figure 2. Distribution of simple summary measures of tree topology 



Table 1. Results of cross-validated classification 



KNN 


Horn 


SS 


Chains 


Baseline 


0.89 (0.03) 


0.85 (0.05) 


1.00 (0) 


Varied parameters 


0.90 (0.01) 


0.89 (0.01) 


1.00 (0) 


Varied sampling 


0.98 (0.01) 


0.21 (0.01) 


1.00 (0) 


Varied both 


0.98 (0.01) 


0.20 (0.01) 


1.00 (0) 


10 isolates 


0 (0.001) 


0.5 (0.5) 


0.5 (0.53) 


20 isolates 


0.99 (0.01) 


0.01 (0.001) 


0 (0) 


SVM 


Spec. 


Sens. 


AUC 


Baseline 


0.93 (0.05) 


0.89 (0.05) 


0.98 (0.01) 


Varied parameters 


0.92 (0.05) 


0.92 (0.04) 


0.98 (0.01) 


Varied sampling 


0.79 (0.07) 


0.76 (0.05) 


0.86 (0.003) 


Varied both 


0.83 (0.07) 


0.74 (0.06) 


0.87 (0.002) 


10 isolates 


1 (0) 


0 (0) 


0.61 (0.07) 


20 isolates 


0.93 (0.13) 


0.21 (0.34) 


0.78 (0.07) 



Sensitivity (the true negative rate) here is the portion of homogeneous outbreaks correctly classified as homogeneous, 
and specificity (true positive rate) is the portion of super-spreader outbreaks correctly classified. For SVM classification, 
sensitivity and specificity have a trade-off, such that greater sensitivity can be achieved at the cost of reduced specificity 
and vice versa. Sensitivity and specificity are computed with the optimal threshold returned by matlab's perfcurve 
function. The AUC captures the overall classifier quality. For KNN classification we report the portion correct by 
outbreak type, as there are three types. Numbers shown are mean (standard deviation) using the 10 classifiers found 
with 10-fold cross validation of the baseline case. 



Effects of the extent of super-spreading, 
sampling, and early classification 

Outbreak classification is robust to variable 
parameters and model choice, but not to 
sampling 

To explore how robustly phylogenetic structure cap- 
tures variation in transmission processes, we 



performed sensitivity analyses in which we explored 
the effect of varying the transmission parameters 
P/D, sampling and both the parameters and 
sampling together. 

Using the KNN classifier applied to the three 
outbreak types, we found that the overall classi- 
fier error remained at ~10% when the 
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Figure 3. Box plots of the features used to summarize the shapes of phylogenies 



transmission rate varied up to a factor of 2 
(Table 1). The effect of reduced sampling density 
was much greater, and while the portion of 
homogeneous outbreaks correctly predicted was 
high (98%), the error was high because only 21% 
of super-spreader outbreaks were correctly classi- 
fied. Mis-classification was between these two 
outbreak types, and chains of transmission were 
always correctly classified. Varying both the 



sampling and the parameters decreased the qual- 
ity of the predictions slightly. 

We also evaluated the sensitivity of SVM classifi- 
cation to different transmission model parameters 
by training and testing an SVM on a further 500 
simulated super-spreading, and homogenous net- 
works with variable transmission parameters (3/D. 
As with the baseline parameter networks, the SVM 
returned an AUC ofO. 98 forthe variable p/D groups, 
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Standard deflation / mean of ttcondiry case nunbsrs Fa Is* pes it ivm fraction False positive fraction 

Sensitivity as variability in secondary ROC with sensitivity analysis ROC with early isolates 

cases numbers increases groups 

Figure 4. (a) Sensitivity oftheSVM classification increases as the variability inthe number of secondary cases in the outbreak increases. Variability is quantified as 
the ratio of the standard deviation to the mean of the numbers of secondary cases caused by an infectious case. Sensitivity is the portion of simulated outbreaks 
with the corresponding variability that were classed as super-spreader outbreaks; the solid line shows the mean sensitivity over the 10 SVMs produced by cross- 
validation and dotted lines are the mean + the standard deviation, (b and c) ROCs for the SVM classifier based on the 1 1 summary metrics describing tree shape. 
ROC curves area visual way to assess the classifier's quality — a perfect classifier will obtain all the true positives and will have no false positives, giving an AUC of 1 . 
An imperfect classifier has a trade-off, and can attain a specificity (true positive rate) of 1 at the cost of having a false-positive rate of 1 (top right corner ofthe plot). 
The ROC curve illustrates the shape of this trade-off; the higher the AUC, the higher the quality ofthe classifier. Guessing yields an AUC of 0.5. In b, different lines 
correspond to the different groups of simulations in the SVM sensitivity analysis. Panel c shows the SVM classifier's performance when only the earliest outbreak 
isolates are sampled. Performance is poor with 10 isolates (black line) and better with 20 (blue line) 



though the sensitivity and specificity were both 
slightly reduced (0.92; see Table 1). However, the 
SVM's performance declined with decreased 
sampling density (AUC of 0.86; sensitivity 0.76 and 
specificity 0.79), and decreased sampling with vari- 
able transmission parameters (AUC of 0.87, sensi- 
tivity 0.74 and specificity 0.83). Notably, the decline 
in performance was much less with the SVM method 
than the KNN method. Figure 4b shows the ROCs 
for SVM classification on these groups. 

The decline in performance due to lower sampling 
density occurs for two primary reasons. The super- 
spreaders are relatively rare; ifthey were not, then the 
outbreak would not really be a 'super-spreader' out- 
break, but one with a higher rate of transmission 
overall. When sampling density is reduced, there is 
therefore a good chance that the super-spreader in- 
dividuals are not sampled. In addition, under weak 
sampling, only afew of a super-spreader's secondary 
cases would be sampled. Both of these factors act to 
reduce the ability of the genealogy to capture super- 
spreading. Under very low sampling densities, it is 
likely that the probability of a given tree approaches 
what it would be under the homogeneous birth- 
death model or the appropriate coalescent model, 
even where the infectious period is not memoryless. 
Though we have not shown this, very low sampling 



should reduce the asymmetry that arises from one 
lineage continuing in the same host and another 
continuing in a new host, because each lineage 
would be expected to change hosts multiple times 
along a branch under low sampling densities. 
Accordingly, if sampling density is low enough that 
coalescent methods are appropriate, they may be 
used to relate branching times and some aspects 
of tree shapes to epidemic models [24]. 

We varied the extent of heterogeneity in the num- 
bers of secondary cases in our outbreaks, using a 
negative binomial distribution and varying its par- 
ameters. We found that the classifier (trained on 
outbreaks each with a single super-spreader but with 
varying secondary case numbers) had a high sensi- 
tivity of classification (>0.7) when the ratio of the 
standard deviation to the mean of the secondary 
case number distribution was 2 or more. Figure 4a 
shows the average sensitivity increasing with the 
variability in secondary case numbers. 

We tested the SVM classifiers to determine 
whether they could distinguish between phylogen- 
etic trees derived from simulated sequence trans- 
mission on different contact networks, namely 
dynamical models of sexual contact networks over 
a 5-year simulated time period [22]. The performance 
was good when sampling was done over time, such 
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that cases infected early in the simulation were likely 
to be sampled. When sampling was done at one 
time, years after seeding the simulated infection, 
neither classifier detected differences between the 
two types of contact network. Details are presented 
in the Supplementary Information. 

Outbreak classification is possible using early 
isolates only 

To determine whether classification of an outbreak 
is possible early in an outbreak — information that 
could potentially inform real-time deployment of a 
specific public health response — we evaluated the 
10 KNN and 10 SVM classifiers' performance when 
only the first 1 0 and first 20 genomes ofthe outbreak 
were sampled (1 0 of each, constructed using 1 0-fold 
cross-validation). The KNN performed poorly on the 
first 1 0 isolates, with none of the homogeneous out- 
breaks correctly classified, and only 50% of the 
others. Mis-classifications were between the super- 
spreader and chain outbreaks. After 20 isolates had 
been sampled, KNN classifiers grouped all out- 
breaks with homogeneous transmission. The SVM 
had AUC values of 0.61 and 0.78 after 10 and 20 
isolates were detected, respectively (see Table 1 
and Fig. 4c), although the optimal cutoffs gave 
low-sensitivity values. These data suggest that 
SVM classification can give some information about 
an outbreak's transmission dynamics at early points 
within the outbreak. 

Real-world outbreaks 

Topological metric-based classification 
recapitulates known epidemiology of real-world 
outbreaks 

Finally, to evaluate the classifiers' performance on 
real-world outbreaks with known epidemiology, 
we applied the classifier to genome sequence data 
from two tuberculosis outbreaks whose underlying 
transmission dynamics have been described 
through comprehensive field and genomic epidemi- 
ology. Outbreak A [31] was reported to arise from 
super-spreading activity, while Outbreak B displayed 
multiple waves oftransmission, resulting in a some- 
what more homogenous network. 

We found that our classification results agreed 
with the empirical characterizations of the two out- 
breaks' underlying transmission dynamics. In the 
KNN classification, Outbreak A was grouped with 
super-spreader outbreaks most often (56(0.5)%), 
with 44% ofthe posterior trees grouping with homo- 
geneous outbreaks none with chains. 77(0.7)% of 



the trees from Outbreak B were classed as homoge- 
neous, with the other 23% classed with super- 
spreader outbreaks. As above, numbers in 
parentheses are standard deviations over the 10 
classifiers from the 10-fold cross-validation. The 
SVM classification grouped 75(8)% of BEAST pos- 
terior Outbreak A trees with super-spreaders, and 
76(9)% of Outbreak B trees with homogeneous 
transmission. We also applied the classifiers to the 
maximum clade credibility (MCC) trees for the two 
outbreaks; the MCC tree from Outbreak A grouped 
with super-spreaders and that from B grouped with 
homogeneous outbreaks in all of the 10 cross- 
validated classifications. Thus both classifiers' pre- 
dictions agree with epidemiological investigations 
of the outbreak, using tree shapes alone to classify 
transmission patterns. 

DISCUSSION 

We have found that there are simple topological 
properties of phylogenetic trees which, when 
combined, are informative as to the underlying 
transmission patterns at work in an outbreak. 
Tree structures can be used as the basis of a clas- 
sification system, able to describe an outbreak's 
dynamics from genomic data alone. These topo- 
logical signatures are robust to variation in the 
transmissibility, and to the nature and structure 
of the model, but sampling has a detrimental 
effect on the strength of the signal. Signs of the 
underlying transmission dynamics are present 
within the first 20 genomes sampled from an out- 
break, and the classifiers are able to recapitulate 
known, real-world epidemiology from actual out- 
break datasets. 

The relationship between host contact heterogen- 
eity and pathogen phylogenies is complex. In large 
datasets, phylogenetic branch lengths can reveal 
heterogeneous contact numbers [12], but distribu- 
tions of branch lengths are not a suitable tool for 
small outbreaks of a chronically infecting and slowly 
mutating organism like TB. Early work made the 
assumption that heterogeneous contact numbers 
would yield heterogeneous cluster sizes in viral 
phylogenies [34]. But cluster sizes also depend on 
the pathogen population dynamics [22] and the epi- 
demic dynamics [24]. The relationship between het- 
erogeneous contact numbers and tree imbalance 
[13] is not robust to the dynamics of a contact net- 
work [22], sampling [22,24] or the epidemic model 
used [24]. It is clear from this body of work that 
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increased heterogeneity in contact numbers will not 
always lead to a simple increase or decrease of some 
measure, like imbalance, of tree structure. However, 
we have found that in small outbreaks, several sim- 
ple topological features, taken together, can distin- 
guish between outbreaks with high heterogeneity (a 
super-spreader) and low heterogeneity. 

In any modelling endeavor, when a model repro- 
duces features of real data — whether those are tree 
structures, branch lengths or other data such as 
prevalence and incidence of an infection, locations 
of cases and so on — it remains possible that there 
are processes not included in the model that are the 
real origin of the observations. When we use models 
to interpret data, we use formal or informal priors to 
weigh the likelihoods ofthe assumptions behind the 
model when compared with other processes that 
could drive the same phenomena. Here, one aspect 
ofthe complex relationship between contact hetero- 
geneity and phylogeny structure is illustrated by the 
fact that genealogies from a long chain of transmis- 
sion can look similar to genealogies derived from a 
super-spreader. Indeed, if one individual infects 10 
others over a long period, and none of those infects 
anyone else, the genealogy among isolates would 
look the same as a genealogy in which each case 
infected precisely one other. However, it is unlikely 
that such a chain of cases would occur, with no 
one 'ever' infecting two others rather than one. 
Similarly, it is unlikely that one host could infect 
everyone in an outbreak, with no onward transmis- 
sion by anyone else. In our simulations, once the 
occasional person in a long chain can infect two 
others, and if non-super-spreader individuals infect 
others homogeneously, we find that simple 
topological structures are well able to resolve the 
differences between chains and super-spreader 
outbreaks. 

We have used 1 1 coarse and simple summaries of 
tree topology. However, any small set of a few sum- 
mary statistics cannot capture the topology with 
much resolution. In contrast, most methods to com- 
pare phylogenies in fine detail are suited only for 
phylogenies on the same sets of tips [35], and so 
cannot be used to compare different outbreaks or 
to compare simulations to data. Finding the correct 
balance to summarize trees sufficiently that 
they can be compared across different tree sizes, 
different outbreaks and different settings, without 
summarizing them so much as to remove the most 
useful information is a challenge, and a number of 
methods will likely be developed, beginning with 



viral pathogens as in the recent work on Poon et al. 
[23]. Indeed, although we feel that the measures 
we have used are demonstrative that tree 
structure is revealing, they are not intended to be 
comprehensive or exhaustive descriptions of tree 
topology. The fact that a few simple topological 
summaries can reveal underlying transmission pat- 
terns is a proof-of-principle that tree shape is 
informative. 

We have taken a different approach than has 
recently been taken in a number of studies aiming 
to infer transmission trees from phylogenetic data 
[7,9,36,37], or to identify or at least rule out trans- 
mission events based on epidemiological and gen- 
etic data [2-6]. These methods use the timing of case 
presentation (and estimated times of infection) to 
help determine who infected whom. In contrast, in 
pathogens with long and variable infectious dur- 
ation, the timing of case presentation does not pro- 
vide much information aboutthe timingof infection. 
In this setting, even whole-genome sequence data 
may not contain sufficient information to clearly 
characterize individual transmission events, as we 
have recently found [9]. However, individual trans- 
mission events are often of interest mainly because 
they reveal 'patterns' of transmission. When we 
reconstruct an outbreak we are not seeking to 
determine whether case C will infect case D in the 
next outbreak, but rather, to find sufficient informa- 
tion about how the outbreak occurred that pub- 
lic health practices can benefit. Here, we have 
found that tree shapes can reveal overall patterns 
of transmission without first inferring who infected 
whom. 

The classification method we have developed 
provides not only an important empirical quan- 
tification of the degree to which genomic data is 
informative in the absence of epidemiological infor- 
mation, but is also a useful tool that can be used to 
describe outbreaks both retrospectively and pro- 
spectively. The ability to situate an outbreak on 
the spectrum from homogeneous transmission to 
super-spreading and to do so within the earliest 
stages of an outbreak when neither a large num- 
ber of specimens nor detailed epidemiological in- 
formation is available represents an important 
opportunity for public health investigations. 
Situating an outbreak on this spectrum does not 
require pinning down individual transmission 
events, but relies more on characterizing sum- 
mary features ofthe outbreak and/or its phylogeny. 
If the data point towards a significant role for super- 
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spreading in an outbreak, a containment strategy 
will require intensive screening of the super- 
spreader's contacts. In an outbreak where onward 
transmission is occurring in chains, a focus on ac- 
tive case finding around multiple individuals will be 
needed instead. Ultimately, investigation ofany out- 
break of a communicable disease will involve the 
collation of multiple sources of information, 
including epidemiological, clinical and genomic 
data. The approach described here represents one 
part ofthis toolbox, and has the advantages of being 
robust to the unique nature of complex chronic in- 
fection, providing useful information even when 
epidemiological information is incomplete, and 
being informative within the earliest stages of an 
outbreak. 

SUPPLEMENTARY DATA 

Supplementary data is available at EMPH online. 
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